Assignment 2

Author

Jared Tavares and Heeeyletje van Zyl

Abstract

Introduction

The field of Natural Language Processing (NLP) is faceted by techniques tailored for theme tracking and opinion mining which merge part of text analysis. Though, of particular prominence, is the extraction of latent thematic patterns and the establishment of the extent of emotionality expressed in political-based texts.

Given such political context, it is of specific interest to analyse the annual State of the Nation Address (SONA) speeches delivered by six different South African presidents (F.W. de Klerk, N.R. Mandela, T.M. Mbeki, K.P. Motlanthe, J.G. Zuma, and M.C. Ramaphosa) ranging over twenty-nine years (from 1994 to 2023). This analysis, descriptive and data-driven in nature, endeavours to examine the content of the SONA speeches in terms of themes via topic modelling (TM) and emotions via sentiment analysis (SentA). In general, as illustrated in Figure, this exploration will be double-bifurcated, executing the aforementioned techniques within a macro and micro context both at the text (all-presidents versus by-president SONA speeches, respectively) and token (sentences versus words, respectively) level.

Schematic representation of sentiment analysis and topic modelling investigated within at different scales within differnt levels.

Through such a multi-layered lens, the identification of any trends, both in terms of topics and sentiments, over time at both a large (presidents as a collective) as well as at a small (each president as an individual) scale is attainable. This explicates not only an aggregated perspective of the general political discourse prevailing within South Africa, but also a more niche outlook of the specific rhetoric employed by each of the country’s serving presidents during different date periods.

To achieve all of the above-mentioned analysis, it is first relevant to revise foundational terms and review related literature in context of politics and NLP. All pertinent pre-processing of the political text data is then considered, followed by a discussion delving into the details of each SentA and TM approach applied as part of the analysis. Specifically, three different lexicons are leveraged to describe sentiments, whilst five different topic models are tackled to uncover themes within South-African-presidents’ SONA speeches. Ensuing the implementation of these methodologies, the results thereof are detailed in terms insights and interpretations. Thereafter, an overall evaluation of the techniques in terms of efficacy and inadequacy is overviewed. Finally, focal findings are highlighted and potential improvements as part of future research are recommended.

Methods

Topic modelling

Latent Semantic Analysis (LSA)

LSA (Deerwester et al. 1990) is a non-probabilistic, non-generative model where a form of matrix factorization is utilized to uncover few latent topics, capturing meaningful relationships among documents/tokens. As depicted in Figure, in the first step, a document-term matrix DTM is generated from the raw text data by tokenizing d documents into w words (or sentences), forming the columns and rows respectively. Each row-column entry is either valued via the BoW or tf-idf approach. This DTM-matrix, which is often sparse and high-dimensional, is then decomposed via a dimensionality-reduction-technique, namely truncated Singular Value Decomposition (SVD). Consequently, in the second step the DTM-matrix becomes the product of three matrices: the topic-word matrix At* (for the tokens), the topic-prevalence matrix Bt* (for the latent semantic factors), and the transposed document-topic matrix CTt* (for the document). Here, t*, the optimal number of topics, is a hyperparameter which is refined at a value (either via the Silhouette-Coefficient or the coherence-measure approach) that retains the most significant dimensions in the transformed space. In the final step, the text data is then encoded using this top-topic number.

Given LSA only implicates a DTM-matrix, the implementation thereof is generally efficient. Though, with the involvement of truncated SVD, some computational intensity and a lack of quick updates with new, incoming text-data can arise. Additional LSA drawbacks include: the lack of interpretability, the underlying linear-model framework (which results in poor performance on text-data with non-linear dependencies), and the underlying Gaussian assumption for tokens in documents (which may not be an appropriate distribution).

Probabilistic Latent Semantic Analysis (pLSA)

Instead of implementing truncated SVD, pLSA (Hofmann 1999) rather utilizes a generative, probabilistic model. Within this framework, a document d is first selected with probability P(d). Then given this, a latent topic t is present in this selected document d and so chosen with probability of P(t|d). Finally, given this chosen topic t, a word w (or sentence) is generated from it with probability P(w|t), as shown in Figure. It is noted that the values of P(d) is determined directly from the corpus D which is defined in terms of a DTM matrix. In contrast, the probabilities P(t|d) and P(w|t) are parameters modelled as multinomial distributions and iteratively updated via the Expectation-Maximization (EM) algorithm. Direct parallelism between LSA and pLSA can be drawn via the methods’ parameterization, as conveyed via matching colours of the topic-word matrix and P(w|t), the document-topic matrix and P(d|t) as well as the topic-prevalence matrix and P(t) displayed in Figure and Figure, respectively.

Despite pLSA implicitly addressing LSA-related disadvantages, this method still involves two main drawbacks. There is no probability model for the document-topic probabilities P(t|d), resulting in the inability to assign topic mixtures to new, unseen documents not trained on. Model parameters also then increase linearly with the number of documents added, making this method more susceptible to overfitting.

Latent Dirichlet Allocation

Schematic representation of LDA.

LDA is another generative, probabilistic model which can be deemed as a hierarchical Bayesian version of pLSA. Via explicitly defining a generative model for the document-topic probabilities, both the above-mentioned pitfalls of pLSA are improved upon. The number of parameters to estimate drastically decrease and the ability to apply and generalize to new, unseen documents is attainable. As presented in Figure, the initial steps first involve randomly sampling a document-topic probability distribution (\(\theta\)) from a Dirichlet (Dir) distribution (\(\eta\)), followed by randomly sampling a topic-word probability distribution (\(\phi\)) from another Dirichlet distribution (\(\tau\)). From the \(\theta\) distribution, a topic t is selected by drawing from a multinomial (Mult) distribution (third step) and from the \(\phi\) distribution given said topic t, a word w (or sentences) is sampled from another multinomial distribution (fourth step). The associated LDA-parameters are then estimated via a variational expectation maximization algorithm or collapsed Gibbs sampling.

Correlated Topic Model (CTM)

Following closely to LDA, the CTM (Lafferty and Blei 2005) additionally allows for the ability to model the presence of any correlated topics. Such topic correlations are introduced via the inclusion of the multivariate normal (MultNorm) distribution with t length-vector of means (\(\mu\)) and t \(\times\) t covariance matrix (\(\Sigma\)) where the resulting values are then mapped into probabilities by passing through a logistic (log) transformation. Comparing Figure and Figure, the nuance between LDA and CTM is highlighted in light-blue, where the discrepancy in the models come about from replacing the Dirichlet distribution (which involves the implicit assumption of independence across topics) with the logit-normal distribution (which now explicitly enables for topic dependency via a covariance structure) for generating document-topic probabilities. The other generative processes previously outlined for LDA is retained and repeated for CTM. Given this additional model complexity, the more convoluted mean-field variational inference algorithm is employed for CTM-parameter estimation which necessitate many iterations for optimization purposes. CTM is consequently computationally more expensive than LDA. Though, this snag is far outweighed by the procurement of richer topics with overt relationships acknowledged between these.

Read in the data

Exploratory Data Analysis

Sentiment analysis

Topic modelling

LSA

(0, '0.267*"year" + 0.242*"government" + 0.198*"work" + 0.195*"south" + 0.188*"people" + 0.163*"country" + 0.145*"development" + 0.142*"national" + 0.140*"programme" + 0.134*"african"')
(1, '-0.169*"government" + 0.146*"south" + -0.142*"regard" + 0.135*"year" + -0.134*"people" + 0.115*"energy" + 0.114*"000" + -0.113*"shall" + -0.112*"ensure" + -0.102*"question"')
(2, '-0.140*"honourable" + -0.131*"programme" + 0.125*"pandemic" + -0.123*"continue" + 0.115*"new" + -0.110*"development" + -0.109*"rand" + 0.107*"great" + -0.106*"compatriot" + 0.102*"investment"')
(3, '-0.337*"alliance" + -0.240*"transitional" + -0.204*"party" + -0.204*"constitution" + -0.156*"zulu" + -0.155*"constitutional" + -0.131*"south" + -0.126*"concern" + -0.125*"election" + -0.122*"freedom"')
(4, '0.219*"shall" + -0.204*"people" + 0.148*"year" + -0.144*"alliance" + 0.130*"start" + -0.101*"government" + -0.097*"address" + -0.093*"transitional" + 0.088*"community" + 0.088*"citizen"')

pLSA (Probabilistic Latent Semantic Analysis)

[(0,
  '0.000*"year" + 0.000*"government" + 0.000*"people" + 0.000*"work" + 0.000*"south" + 0.000*"development" + 0.000*"programme" + 0.000*"african" + 0.000*"make" + 0.000*"country"'),
 (1,
  '0.001*"year" + 0.001*"government" + 0.001*"south" + 0.001*"work" + 0.001*"country" + 0.001*"people" + 0.000*"african" + 0.000*"national" + 0.000*"development" + 0.000*"make"'),
 (2,
  '0.001*"government" + 0.001*"year" + 0.001*"work" + 0.001*"south" + 0.001*"development" + 0.001*"programme" + 0.001*"country" + 0.001*"people" + 0.001*"national" + 0.001*"africa"'),
 (3,
  '0.001*"year" + 0.001*"government" + 0.001*"people" + 0.001*"south" + 0.001*"country" + 0.001*"work" + 0.001*"public" + 0.000*"national" + 0.000*"african" + 0.000*"programme"'),
 (4,
  '0.000*"year" + 0.000*"work" + 0.000*"south" + 0.000*"people" + 0.000*"government" + 0.000*"new" + 0.000*"national" + 0.000*"programme" + 0.000*"african" + 0.000*"country"')]

LDA (Latent Dirichlet Allocation)

[(0,
  '0.001*"year" + 0.001*"government" + 0.001*"work" + 0.001*"south" + 0.000*"country" + 0.000*"people" + 0.000*"african" + 0.000*"make" + 0.000*"national" + 0.000*"programme"'),
 (1,
  '0.001*"government" + 0.001*"people" + 0.001*"year" + 0.001*"country" + 0.001*"development" + 0.001*"south" + 0.001*"work" + 0.001*"national" + 0.001*"ensure" + 0.001*"programme"'),
 (2,
  '0.000*"year" + 0.000*"people" + 0.000*"government" + 0.000*"work" + 0.000*"south" + 0.000*"country" + 0.000*"national" + 0.000*"africa" + 0.000*"african" + 0.000*"new"'),
 (3,
  '0.001*"year" + 0.001*"work" + 0.001*"south" + 0.001*"government" + 0.000*"national" + 0.000*"development" + 0.000*"programme" + 0.000*"african" + 0.000*"country" + 0.000*"people"'),
 (4,
  '0.001*"government" + 0.001*"year" + 0.001*"people" + 0.001*"south" + 0.001*"work" + 0.000*"country" + 0.000*"african" + 0.000*"make" + 0.000*"programme" + 0.000*"public"')]

CTM (Correlated Topic Model)

Iteration: 0    Log-likelihood: -6.806060638136552
Iteration: 1    Log-likelihood: -6.4991008022521966
Iteration: 2    Log-likelihood: -6.369259739371707
Iteration: 3    Log-likelihood: -6.26304785681647
Iteration: 4    Log-likelihood: -6.185946296974337
Iteration: 5    Log-likelihood: -6.134287753264621
Iteration: 6    Log-likelihood: -6.109915181715152
Iteration: 7    Log-likelihood: -6.079695988560677
Iteration: 8    Log-likelihood: -6.064937425507435
Iteration: 9    Log-likelihood: -6.050584418493078
Iteration: 10   Log-likelihood: -6.034739153991372
Iteration: 11   Log-likelihood: -6.019377037711714
Iteration: 12   Log-likelihood: -6.016380140777936
Iteration: 13   Log-likelihood: -6.00109352711282
Iteration: 14   Log-likelihood: -5.995721197708139
Iteration: 15   Log-likelihood: -5.990933282269136
Iteration: 16   Log-likelihood: -5.971662762789347
Iteration: 17   Log-likelihood: -5.96502078798535
Iteration: 18   Log-likelihood: -5.958118799160602
Iteration: 19   Log-likelihood: -5.9531084456853245
Iteration: 20   Log-likelihood: -5.944486804720053
Iteration: 21   Log-likelihood: -5.938012245490582
Iteration: 22   Log-likelihood: -5.933821229026691
Iteration: 23   Log-likelihood: -5.949856636020724
Iteration: 24   Log-likelihood: -5.940510352577949
Iteration: 25   Log-likelihood: -5.9335820992213435
Iteration: 26   Log-likelihood: -5.923201381649148
Iteration: 27   Log-likelihood: -5.914508644967382
Iteration: 28   Log-likelihood: -5.9065580713426975
Iteration: 29   Log-likelihood: -5.905873796682177
Iteration: 30   Log-likelihood: -5.908890516274657
Iteration: 31   Log-likelihood: -5.9159279062762975
Iteration: 32   Log-likelihood: -5.919451288807816
Iteration: 33   Log-likelihood: -5.916352239327162
Iteration: 34   Log-likelihood: -5.904445462544731
Iteration: 35   Log-likelihood: -5.90355079827311
Iteration: 36   Log-likelihood: -5.903778246084898
Iteration: 37   Log-likelihood: -5.898181044549602
Iteration: 38   Log-likelihood: -5.907914501451451
Iteration: 39   Log-likelihood: -5.905810166006126
Iteration: 40   Log-likelihood: -5.905841173732596
Iteration: 41   Log-likelihood: -5.906134625992156
Iteration: 42   Log-likelihood: -5.9052580324097255
Iteration: 43   Log-likelihood: -5.90364436182455
Iteration: 44   Log-likelihood: -5.901806920529693
Iteration: 45   Log-likelihood: -5.888644504114337
Iteration: 46   Log-likelihood: -5.895095708522838
Iteration: 47   Log-likelihood: -5.897677977320307
Iteration: 48   Log-likelihood: -5.8864371510315046
Iteration: 49   Log-likelihood: -5.896902241408313
Iteration: 50   Log-likelihood: -5.895239007794952
Iteration: 51   Log-likelihood: -5.898355955829041
Iteration: 52   Log-likelihood: -5.9038167224317135
Iteration: 53   Log-likelihood: -5.907209492015456
Iteration: 54   Log-likelihood: -5.9166436739943125
Iteration: 55   Log-likelihood: -5.916346547788169
Iteration: 56   Log-likelihood: -5.909494670202024
Iteration: 57   Log-likelihood: -5.905284184781802
Iteration: 58   Log-likelihood: -5.90514378706075
Iteration: 59   Log-likelihood: -5.9017087267312505
Iteration: 60   Log-likelihood: -5.91317505660643
Iteration: 61   Log-likelihood: -5.909134920425876
Iteration: 62   Log-likelihood: -5.913926086479564
Iteration: 63   Log-likelihood: -5.91826352753576
Iteration: 64   Log-likelihood: -5.91380315414582
Iteration: 65   Log-likelihood: -5.907729839451928
Iteration: 66   Log-likelihood: -5.906292513117799
Iteration: 67   Log-likelihood: -5.902328931572585
Iteration: 68   Log-likelihood: -5.890156736765506
Iteration: 69   Log-likelihood: -5.894558073044999
Iteration: 70   Log-likelihood: -5.893438575693955
Iteration: 71   Log-likelihood: -5.887300079245029
Iteration: 72   Log-likelihood: -5.8957777884698
Iteration: 73   Log-likelihood: -5.897835016030018
Iteration: 74   Log-likelihood: -5.891470911744754
Iteration: 75   Log-likelihood: -5.886689724100525
Iteration: 76   Log-likelihood: -5.887734580522131
Iteration: 77   Log-likelihood: -5.886254720317691
Iteration: 78   Log-likelihood: -5.889657526902387
Iteration: 79   Log-likelihood: -5.887767966238892
Iteration: 80   Log-likelihood: -5.884198603533635
Iteration: 81   Log-likelihood: -5.891291829723196
Iteration: 82   Log-likelihood: -5.893369154512669
Iteration: 83   Log-likelihood: -5.890170987587162
Iteration: 84   Log-likelihood: -5.881565527240101
Iteration: 85   Log-likelihood: -5.89082483241709
Iteration: 86   Log-likelihood: -5.900718369689624
Iteration: 87   Log-likelihood: -5.893630170965004
Iteration: 88   Log-likelihood: -5.886434235570504
Iteration: 89   Log-likelihood: -5.904616985311801
Iteration: 90   Log-likelihood: -5.898261348729659
Iteration: 91   Log-likelihood: -5.896530269730872
Iteration: 92   Log-likelihood: -5.89687415280678
Iteration: 93   Log-likelihood: -5.903650251361154
Iteration: 94   Log-likelihood: -5.898197371004731
Iteration: 95   Log-likelihood: -5.891635254283736
Iteration: 96   Log-likelihood: -5.889364076638666
Iteration: 97   Log-likelihood: -5.894286800202921
Iteration: 98   Log-likelihood: -5.907204477960446
Iteration: 99   Log-likelihood: -5.898059322729286
Topic #0: [('make', 0.04927268996834755), ('new', 0.04468926414847374), ('need', 0.03552241623401642), ('investment', 0.027677711099386215), ('society', 0.025562284514307976), ('project', 0.02212471514940262), ('president', 0.018510861322283745), ('issue', 0.01833457686007023), ('policy', 0.0157784353941679), ('shall', 0.01516143698245287)]
Topic #1: [('year', 0.08232076466083527), ('south', 0.07741256803274155), ('improve', 0.03608040139079094), ('area', 0.026780663058161736), ('nation', 0.024886270985007286), ('000', 0.017050379887223244), ('high', 0.017050379887223244), ('opportunity', 0.015069880522787571), ('world', 0.012572729028761387), ('black', 0.012228294275701046)]
Topic #2: [('public', 0.04011192545294762), ('challenge', 0.023158671334385872), ('local', 0.01992531679570675), ('level', 0.017391066998243332), ('support', 0.016604576259851456), ('member', 0.016167636960744858), ('health', 0.016167636960744858), ('change', 0.015206369571387768), ('like', 0.01485681813210249), ('institution', 0.01476943027228117)]
Topic #3: [('service', 0.043100398033857346), ('economic', 0.038134198635816574), ('million', 0.02527529187500477), ('process', 0.02359033189713955), ('progress', 0.019954362884163857), ('education', 0.018003355711698532), ('achieve', 0.01711653545498848), ('start', 0.016141030937433243), ('parliament', 0.015431574545800686), ('department', 0.014278707094490528)]
Topic #4: [('work', 0.07791358232498169), ('african', 0.056609995663166046), ('national', 0.05033918097615242), ('water', 0.01846969686448574), ('land', 0.018297893926501274), ('important', 0.017438877373933792), ('past', 0.01735297590494156), ('action', 0.01606445200741291), ('resource', 0.015892649069428444), ('world', 0.013831011950969696)]
Topic #5: [('people', 0.07293239235877991), ('africa', 0.047780591994524), ('include', 0.03446493297815323), ('effort', 0.020191935822367668), ('honourable', 0.019669752568006516), ('infrastructure', 0.018190234899520874), ('child', 0.01801617443561554), ('critical', 0.014099803753197193), ('labour', 0.012707316316664219), ('rate', 0.012098102830350399)]
Topic #6: [('development', 0.060912903398275375), ('ensure', 0.04804292321205139), ('business', 0.0312943235039711), ('regard', 0.027591999620199203), ('far', 0.02697494626045227), ('address', 0.025917140766978264), ('place', 0.015956128016114235), ('matter', 0.015339074656367302), ('democratic', 0.013840515166521072), ('training', 0.01154860109090805)]
Topic #7: [('government', 0.10281074792146683), ('social', 0.03444657847285271), ('growth', 0.029437894001603127), ('life', 0.026801742613315582), ('provide', 0.024868566542863846), ('security', 0.021090082824230194), ('woman', 0.017311599105596542), ('implement', 0.017047984525561333), ('time', 0.016960112378001213), ('freedom', 0.014148219488561153)]
Topic #8: [('programme', 0.05629885569214821), ('u', 0.04766419529914856), ('increase', 0.02106943540275097), ('continue', 0.016924798488616943), ('say', 0.016665758565068245), ('right', 0.015456906519830227), ('building', 0.013989013619720936), ('speaker', 0.013816320337355137), ('come', 0.013816320337355137), ('establish', 0.013039201498031616)]
Topic #9: [('country', 0.06946097314357758), ('sector', 0.0425364151597023), ('economy', 0.035520244389772415), ('year', 0.031135136261582375), ('community', 0.025960711762309074), ('plan', 0.02341734804213047), ('create', 0.02324194461107254), ('state', 0.020172368735074997), ('implementation', 0.017453601583838463), ('capacity', 0.017015092074871063)]

ATM (Author-Topic Model)

[(0,
  '0.011*"year" + 0.011*"government" + 0.009*"work" + 0.008*"country" + 0.008*"south" + 0.007*"people" + 0.007*"national" + 0.006*"development" + 0.006*"make" + 0.005*"programme"'),
 (1,
  '0.011*"year" + 0.010*"government" + 0.009*"south" + 0.007*"african" + 0.006*"programme" + 0.006*"work" + 0.006*"people" + 0.006*"national" + 0.005*"development" + 0.005*"continue"'),
 (2,
  '0.009*"year" + 0.009*"south" + 0.008*"people" + 0.006*"work" + 0.006*"government" + 0.005*"africa" + 0.005*"african" + 0.005*"country" + 0.005*"development" + 0.005*"ensure"'),
 (3,
  '0.008*"year" + 0.008*"people" + 0.007*"work" + 0.006*"development" + 0.006*"south" + 0.006*"government" + 0.006*"africa" + 0.006*"country" + 0.005*"programme" + 0.005*"continue"'),
 (4,
  '0.012*"year" + 0.011*"government" + 0.009*"people" + 0.008*"south" + 0.008*"work" + 0.007*"country" + 0.006*"national" + 0.006*"african" + 0.006*"programme" + 0.006*"ensure"')]

References

Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. “Indexing by Latent Semantic Analysis.” Journal of the American Society for Information Science 41 (6): 391–407. https://doi.org/https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
Hofmann, Thomas. 1999. “Probabilistic Latent Semantic Indexing.” In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 50–57. SIGIR ’99. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/312624.312649.
Lafferty, John, and David Blei. 2005. “Correlated Topic Models.” In Advances in Neural Information Processing Systems, edited by Y. Weiss, B. Schölkopf, and J. Platt. Vol. 18. MIT Press. https://proceedings.neurips.cc/paper_files/paper/2005/file/9e82757e9a1c12cb710ad680db11f6f1-Paper.pdf.